Skip to content

feat(retrieval): PageIndex-style page-based agentic strategy (PR-B)#25

Merged
hallelx2 merged 3 commits into
mainfrom
feat/pageindex-strategy
May 27, 2026
Merged

feat(retrieval): PageIndex-style page-based agentic strategy (PR-B)#25
hallelx2 merged 3 commits into
mainfrom
feat/pageindex-strategy

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 27, 2026

Why

FinanceBench's debt-registration question scores 0/1 on our current section-based retrieval against a 508-node 10-K outline — the "pick a section_id" surface is too noisy. PageIndex hits 98.7% on the same benchmark with a smaller interface: 3 tools, page-range navigation, no embeddings.

This PR ports that interface to vectorless-engine as a new strategy + dedicated answer endpoint. The existing endpoints are unchanged; PageIndex is an opt-in, additive surface.

What ships

1. PageIndexStrategy (pkg/retrieval/pageindex_strategy.go)

A new Strategy + CostStrategy implementing a faithful port of PageIndex's three-tool reasoning loop:

  • get_document_structure() — returns the TOC tree as JSON (titles + page ranges, no body text).
  • get_pages(start_page, end_page) — returns the concatenated content of every section whose [PageStart, PageEnd] overlaps the requested range, clipped at PageContentLimit.
  • done(answer, cited_pages, reasoning) — terminates with the natural-language answer and the inclusive page ranges the answer relies on.

The system prompt is a port of the reference PageIndex demo (PageIndex/examples/agentic_vectorless_rag_demo.py:44-52) adapted to vle's JSON-action protocol (llmgate v0.2.0's Tools field is still scaffolding-only). When llmgate wires native tool calling, the action surface is unchanged.

Graceful degradation: the strategy uses a TOCProvider interface for get_document_structure observations. When the persisted documents.toc_tree column is NULL (pre-PR-A state), the provider's ErrNoTOC signal triggers a synthesised view derived from the section tree. Pre-merge of PR-A, every request degrades through this path — and that's fine. The strategy works without it.

Result.Reasoning carries the agent's final answer (/v1/answer/pageindex reads it directly). Result.SelectedIDs is the union of every section whose page range overlaps any cited range, so the existing /v1/query callers still get a section list. A new Result.PagesRead []PageReadEntry records every get_pages call (start/end/section_ids/char_count) for cost debugging and the reasoning trace.

2. POST /v1/answer/pageindex (internal/api/pageindex.go)

  • One round-trip: retrieval + answer + citations come back from a single agentic loop. No separate synthesis call — the model writes its answer inside the done action.
  • Trace token: the strategy's computePageIndexTraceToken hashes doc_id || "pageindex:" model || sorted cited page ranges, folding the strategy name into the model position so page-based and section-based tokens never collide. Stored in the existing replay store; /v1/replay returns byte-identical responses.
  • Per-page-range citations with answer-span quotes pulled via the existing SpanExtractor over the concatenated cited content (offsets back into that content).
  • reasoning_trace (opt-in via body reasoning:true or ?reasoning=true) lists every tool call with hop/tool/args/result_chars/sections_touched. Captured via a new OnEvent hook on PageIndexStrategy.
  • Streaming (stream:true) via Server-Sent Events. One event per tool call so callers watch the navigation in real time, terminated by an answer event carrying the full payload.
  • Per-request overrides for max_hops and max_pages_per_fetch without mutating shared Deps.

3. Config (pkg/config/config.go)

  • New RetrievalConfig.PageIndex block: enabled (default true), max_hops (8), page_content_limit (16000), model (inherit).
  • VLE_RETRIEVAL_PAGEINDEX_* env overrides (Enabled/MaxHops/PageContentLimit/Model).
  • Validate() accepts pageindex as a strategy name and rejects negative knobs.

4. Wiring (cmd/engine/main.go)

  • buildStrategy registers pageindex as a selection strategy choice.
  • A dedicated PageIndexStrategy instance is always wired into api.Deps.PageIndexStrategy (gated by retrieval.pageindex.enabled) regardless of which strategy is selected as default. So a deployment running chunked-tree for /v1/query still gets /v1/answer/pageindex.

5. OpenAPI + config.example.yaml

Full spec for the new endpoint: PageIndexAnswerRequest, PageIndexAnswerResponse, PageIndexCitation, PageReadEntry, PageIndexTraceEntry. Both application/json and text/event-stream content types under 200, with SSE event type documentation. Example config block with operator-readable comments.

Test plan

  • pkg/retrieval/pageindex_strategy_test.go — 15 unit tests: canonical 3-tool sequence, multi-range citations, MaxHops force-done (with and without recovery), TOC fallback (and persisted-TOC precedence), persistent bad JSON, out-of-range + partial-overlap page clamping, empty tree, loader-less degradation, content clipping, empty-citations refusal, trace-token stability + order invariance, parser tolerance.
  • internal/api/pageindex_test.go — 12 end-to-end handler tests via httptest with a mock LLM, mock storage, and a PageIndexTreeLoader test seam: happy path, reasoning trace (body + query param), bad request, document not found, disabled (config + nil strategy), no LLM, replay persistence verifying byte-equal response bytes, SSE event stream shape, per-request override caps the loop, TOC fallback.
  • pkg/config/config_test.go — 5 config tests: defaults, env overrides (all four knobs), enable-toggle from disabled, garbage env rejection, validation negatives.
  • Full go test ./... and go build ./... clean.
  • config.example.yaml parses cleanly via config.Load.
  • Existing tests unchanged.

Risk envelope

  • Opt-in at the request level. Existing endpoints (/v1/query, /v1/answer, /v1/replay) are unchanged. The new /v1/answer/pageindex is purely additive.
  • Works without PR-A. The strategy falls back to a synthesised TOC view when documents.toc_tree is NULL. Even if PR-A is never merged, this PR delivers value.
  • Test coverage gates merge. 32 new tests; existing tests still pass.

Out of scope (NOT in this PR)

  • TOC tree builder (pkg/tree/tree.go TOCNode + ingest stage). PR-A owns that. The TOCProvider interface is the integration point — when PR-A lands, the engine wires a DB-backed implementation reading documents.toc_tree.

hallelx2 added 3 commits May 27, 2026 17:21
Add a new retrieval Strategy modelled on PageIndex's 3-tool
reasoning protocol (get_document_structure, get_pages, done). The
model navigates by inclusive page range rather than by section ID
— a tighter interface for paginated documents (SEC filings,
academic PDFs) where the prior "pick a section ID from a 500-node
outline" surface was too noisy.

The loop:

  - get_document_structure() returns the document's TOC as JSON
    (titles + page ranges, no body text). Wires to a TOCProvider
    that reads documents.toc_tree when present; falls back to a
    synthesised view derived from the section tree when not, so
    the strategy works even before the TOC-builder PR lands.

  - get_pages(start_page, end_page) returns concatenated content
    of every section whose [PageStart, PageEnd] overlaps the
    requested range, clipped to PageContentLimit chars.

  - done(answer, cited_pages, reasoning) terminates with the
    final answer + the page ranges the answer relies on.

SelectWithCost surfaces both the agent's literal answer string
(via Result.Reasoning) and the set of section IDs whose page
range overlaps any cited range (via Result.SelectedIDs), so the
existing /v1/query + /v1/answer callers can consume the strategy
without changes. A new PagesRead field on Result captures every
get_pages call (start/end/section IDs/char count) for cost
debugging and the reasoning-trace surface.

Protocol uses the same JSON-action text shape AgenticStrategy
proved (llmgate v0.2.0's Tools field is still scaffolding-only);
when llmgate wires native tool calling the surface here is
unchanged. The parser tolerates "tool" vs "action" keys, a
"5-7"-string Pages alternative, and string-shaped cited_pages.

Trace-token reuses ComputeTraceToken but folds the strategy name
into the model position so page-based and section-based runs on
the same doc/model don't collide, and tags the page ranges with
"p:" so they share namespace with section IDs without colliding.

15 unit tests cover: the happy 3-tool sequence, multi-range
citations, MaxHops force-done (both with and without recovery),
TOC fallback, persisted-TOC precedence, persistent bad JSON,
out-of-range and partial-overlap page clamping, empty tree,
loader-less degradation, content clipping, empty-citations
refusal, trace-token stability + order invariance, and parser
tolerance for every documented input shape.
Wire the PageIndex strategy through a dedicated answer endpoint
on the existing /v1 router. The endpoint:

  - Owns the full RAG round-trip in one request: retrieval +
    answer + citations come back from a single agentic loop.
    No separate synthesis call — the model emits its answer
    inside the done action and we surface it as `answer` on
    the response.

  - Emits page-grounded citations. One citation per page range
    the agent fetched (deduplicated), each carrying
    start_page / end_page / section_ids plus an answer-span
    quote pulled via the existing SpanExtractor over the cited
    content. Falls back gracefully when the LLM declines a
    quote.

  - Persists every successful response to the existing replay
    store under the strategy's deterministic trace_token. The
    token's input set is sorted cited page ranges (not section
    IDs), and the strategy name is folded into the hash so
    page-based and section-based tokens for the same doc/model
    never collide.

  - Supports an opt-in reasoning trace (body field
    `reasoning:true` or query param `?reasoning=true`) that
    surfaces per-hop tool calls + args + tool-result chars +
    sections touched, captured via a new OnEvent hook on
    PageIndexStrategy.

  - Streams via Server-Sent Events when `stream:true` is set
    on the body. One event per tool call (get_document_structure,
    get_pages, done) so callers WATCH the navigation in real
    time, terminated by an `answer` event carrying the full
    JSON response payload.

  - Honors per-request overrides for max_hops and
    max_pages_per_fetch without mutating shared Deps. Disabled
    deployments (retrieval.pageindex.enabled=false or no LLM
    client) return 501; missing documents 404; bad bodies 400.

Adds `RetrievalConfig.PageIndex` (PageIndexBlock) with defaults
(Enabled=true, MaxHops=8, PageContentLimit=16000) and matching
VLE_RETRIEVAL_PAGEINDEX_* env overrides. Validation rejects
negative knobs and accepts "pageindex" as a retrieval strategy.

cmd/engine/main.go registers the strategy via buildStrategy
when retrieval.strategy=pageindex, AND wires a standalone
PageIndexStrategy instance into the api.Deps used by the
answer endpoint — so the endpoint is available regardless of
which selection strategy the deployment runs by default.

Test coverage: 12 end-to-end handler tests (happy path,
reasoning trace via body field + query param, bad request,
not found, disabled in two modes, no LLM, replay persistence
verifying byte-equal response bytes, SSE event stream shape,
per-request override caps the loop, TOC fallback). Plus 5
config tests for defaults + env overrides + validation.

A PageIndexTreeLoader function field on Deps acts as a test
seam so handler tests can run end-to-end via httptest with
an in-memory tree, without a real Postgres backend.
OpenAPI 3.1 spec for the new endpoint:

  - POST /v1/answer/pageindex documented with the
    PageIndexAnswerRequest body shape (document_id, query,
    optional model, max_hops, max_pages_per_fetch, stream,
    reasoning) and PageIndexAnswerResponse (answer,
    citations, hops_taken, usage, trace_token, pages_read,
    reasoning_trace).

  - PageIndexCitation, PageReadEntry, and
    PageIndexTraceEntry component schemas describe the
    page-grounded citation shape, the per-call navigation
    footprint, and per-hop reasoning trace entries.

  - The 200 response carries content for BOTH
    application/json (non-streaming) and text/event-stream
    (when stream:true) with documentation of the SSE event
    types: `started`, one event per tool call
    (get_document_structure / get_pages / done), and a
    terminal `answer` event carrying the full payload.

  - 501 covers both "no LLM client" and
    "retrieval.pageindex.enabled=false" so operators
    looking at the spec see the toggle that disables the
    endpoint.

  - QueryResponse's strategy enum gains "pageindex" so
    /v1/query responses returned by a pageindex-default
    deployment validate against the schema.

  - ?reasoning=true query parameter is documented as an
    alternative to the body's reasoning field.

config.example.yaml:

  - retrieval.strategy comment lists every available
    strategy with a one-line description of each, so an
    operator picking a strategy can see what they're
    choosing between without reading code.

  - New retrieval.pageindex block with enabled / max_hops /
    page_content_limit / model knobs, default values
    matching the engine defaults, and a comment block
    explaining the three-tool loop, the trace_token /
    reasoning_trace / streaming differentiators, and the
    graceful-degradation behaviour when no TOC tree is
    persisted yet (the synthesised view fallback).
Copilot AI review requested due to automatic review settings May 27, 2026 16:42
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Warning

Review limit reached

@hallelx2, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 45 minutes and 2 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 46ea6261-7690-41ef-943f-cf11a990580e

📥 Commits

Reviewing files that changed from the base of the PR and between 28ffc33 and 432524c.

📒 Files selected for processing (11)
  • cmd/engine/main.go
  • config.example.yaml
  • internal/api/pageindex.go
  • internal/api/pageindex_test.go
  • internal/api/server.go
  • openapi.yaml
  • pkg/config/config.go
  • pkg/config/config_test.go
  • pkg/retrieval/pageindex_strategy.go
  • pkg/retrieval/pageindex_strategy_test.go
  • pkg/retrieval/strategy.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/pageindex-strategy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hallelx2 hallelx2 merged commit e183ca7 into main May 27, 2026
5 of 9 checks passed
@hallelx2 hallelx2 deleted the feat/pageindex-strategy branch May 27, 2026 16:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in PageIndex-style page-range retrieval/answering path alongside the existing section-based retrieval APIs. It introduces a new page-based agentic strategy, a dedicated /v1/answer/pageindex endpoint, config/wiring, tests, and OpenAPI documentation.

Changes:

  • Added PageIndexStrategy with JSON tool-call loop, page reads, trace token support, TOC fallback, and tests.
  • Added /v1/answer/pageindex handler with JSON/SSE responses, reasoning trace, citations, and replay integration.
  • Added PageIndex config, engine wiring, OpenAPI schemas, and example configuration.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/retrieval/strategy.go Adds PagesRead metadata to retrieval results.
pkg/retrieval/pageindex_strategy.go Implements the new PageIndex page-based strategy.
pkg/retrieval/pageindex_strategy_test.go Adds unit coverage for strategy behavior and parsing.
pkg/config/config.go Adds PageIndex config defaults, env overrides, and validation.
pkg/config/config_test.go Adds config tests for PageIndex settings.
openapi.yaml Documents the new endpoint and schemas.
internal/api/server.go Wires the new route and API dependencies.
internal/api/pageindex.go Implements the PageIndex answer endpoint and SSE path.
internal/api/pageindex_test.go Adds handler tests for JSON/SSE/replay/error paths.
config.example.yaml Documents PageIndex configuration.
cmd/engine/main.go Wires PageIndex as a selectable strategy and dedicated endpoint strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/api/pageindex.go
Comment on lines +301 to +308
// Build a citation per UNIQUE page range present in PagesRead.
// The set of pages the model "read" is a superset of what it
// cited — some get_pages calls don't end up in the final
// cited_pages list — but the union is the right cone of trust
// to surface as evidence. The trace token is computed over
// only the strictly-cited ranges, which the strategy already
// has, so citation drift doesn't break replay.
seen := make(map[[2]int]struct{}, len(res.PagesRead))
Comment thread internal/api/pageindex.go
Comment on lines +263 to +283
citations := d.buildPageIndexCitations(r.Context(), t, res, body.Query, body.Model)
final := map[string]any{
"document_id": body.DocumentID,
"query": body.Query,
"answer": res.Reasoning,
"citations": citations,
"strategy": strat.Name(),
"model": budget.ModelName,
"hops_taken": res.HopsTaken,
"usage": map[string]any{
"input_tokens": res.Usage.InputTokens,
"output_tokens": res.Usage.OutputTokens,
"total_tokens": res.Usage.TotalTokens,
"cost_usd": res.Usage.CostUSD,
"llm_calls": res.Usage.LLMCalls,
},
"elapsed_ms": time.Since(started).Milliseconds(),
"trace_token": res.TraceToken,
"pages_read": res.PagesRead,
}
emitSSE("answer", final)
Comment thread internal/api/pageindex.go
Comment on lines +171 to +188
resp := map[string]any{
"document_id": body.DocumentID,
"query": body.Query,
"answer": res.Reasoning, // strategy stores the agent's answer here
"citations": citations,
"strategy": perReq.Name(),
"model": budget.ModelName,
"hops_taken": res.HopsTaken,
"usage": map[string]any{
"input_tokens": res.Usage.InputTokens,
"output_tokens": res.Usage.OutputTokens,
"total_tokens": res.Usage.TotalTokens,
"cost_usd": res.Usage.CostUSD,
"llm_calls": res.Usage.LLMCalls,
},
"elapsed_ms": time.Since(started).Milliseconds(),
"trace_token": res.TraceToken,
"pages_read": res.PagesRead,
Comment thread internal/api/pageindex.go
Comment on lines +264 to +281
final := map[string]any{
"document_id": body.DocumentID,
"query": body.Query,
"answer": res.Reasoning,
"citations": citations,
"strategy": strat.Name(),
"model": budget.ModelName,
"hops_taken": res.HopsTaken,
"usage": map[string]any{
"input_tokens": res.Usage.InputTokens,
"output_tokens": res.Usage.OutputTokens,
"total_tokens": res.Usage.TotalTokens,
"cost_usd": res.Usage.CostUSD,
"llm_calls": res.Usage.LLMCalls,
},
"elapsed_ms": time.Since(started).Milliseconds(),
"trace_token": res.TraceToken,
"pages_read": res.PagesRead,
Comment on lines +430 to +438
if s.TOC != nil {
raw, err := s.TOC.GetTOC(ctx, t.DocumentID)
if err == nil && len(raw) > 0 {
return string(raw)
}
// Log and degrade — the strategy must keep going.
if err != nil {
log.Printf("retrieval: pageindex TOC fetch failed (degrading to synthesised view): %v", err)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants